Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
translated by 谷歌翻译
广义零射击学习(GZSL)旨在识别具有辅助语义信息的新类别,例如,类别属性。在本文中,我们通过逐步提高视觉表现的跨域可转换性和类别辨认性,处理域移位问题的临界问题,即观看和看不见的类别之间的困惑。我们命名为双渐进式原型网络(DPPN)的方法构造了两种类型的原型,分别为属性和类别记录原型视觉模式。使用属性原型,DPPN交替地搜索与属性相关的本地区域并更新相应的属性原型以逐步探索准确的属性区域对应。这使DPPN能够产生具有精确属性定位能力的可视表示,这有利于语义 - 视觉对齐和表示转换性。此外,除了渐进属性本地化之外,DPPN还将项目类别原型进一步投影到多个空间中,以逐步排斥来自不同类别的视觉表示,这提高了类别辨别性。属性和类别原型都在统一的框架中进行了协作学习,这使得DPPN可转移和独特的视觉表示。四个基准测试的实验证明,DPPN有效地减轻了GZSL中的域移位问题。
translated by 谷歌翻译
标记数据的中心性和多样性对半监督学习(SSL)的性能非常有影响,但是大多数SSL模型随机选择标记的数据。迄今为止,如何保证标记数据的中心性和多样性几乎没有得到研究的关注。已经观察到最佳的领先森林(OLF)具有揭示类别开发SSL模型的类别的差异演变的优势。我们对这项研究的关键直觉是学习一个基于OLF结构识别的少量最稳定和最不同的数据,以学习一个核的大幅度度量。提出了一个优化问题以实现这一目标。同样,对于OLF,多个局部指标学习促进了解决SSL中多模式和混合模式问题的促进。归因于这种新颖的设计,与基线方法相比,基于OLF的SSL模型的准确性和性能稳定性在没有牺牲太多效率的情况下得到了显着改善。实验研究表明,与最先进的图形SSL方法相比,提出的方法可以鼓励精度和运行时间。代码已在https://github.com/alanxuji/delala上提供。
translated by 谷歌翻译
近年来,深入学习的蓬勃发展的开花目睹了文本认可的快速发展。但是,现有的文本识别方法主要用于英语文本,而忽略中文文本的关键作用。作为另一种广泛的语言,中文文本识别各种方式​​都有广泛的应用市场。根据我们的观察,我们将稀缺关注缺乏对缺乏合理的数据集建设标准,统一评估方法和现有基线的结果。为了填补这一差距,我们手动收集来自公开的竞争,项目和论文的中文文本数据集,然后将它们分为四类,包括场景,网络,文档和手写数据集。此外,我们在这些数据集中评估了一系列代表性的文本识别方法,具有统一的评估方法来提供实验结果。通过分析实验结果,我们令人惊讶地观察到识别英语文本的最先进的基线不能很好地表现出对中国情景的良好。由于中国文本的特征,我们认为仍然存在众多挑战,这与英文文本完全不同。代码和数据集在https://github.com/fudanvi/benchmarking-chinese-text-recognition中公开使用。
translated by 谷歌翻译
在本文中,我们在学习多层感知(MLPS)中发现了两相现象。即,在第一阶段,培训损失不会显着降低,但不同样本之间的特征的相似性不断增加,这伤害了特征多样性。我们在MLP的学习动态方面解释了这样的两阶段现象。此外,我们提出了两个归一化操作来消除两相现象,这避免了特征多样性的减少,并加快了培训过程。
translated by 谷歌翻译
作为理解人类意图的重要提示,人的凝视为人机交互(HCI)应用提供了一个关键信号。基于外观的凝视估计,直接回归来自眼睛图像的凝视向量,最近基于卷积神经网络(Coundnets)架构和开源大规模凝视数据集来实现了很大的进展。然而,将基于模型的知识进行编码为CNN模型,以进一步提高凝视估计性能仍然是需要探索的主题。在本文中,我们提出了一种明确地将几何眼球模型编码为基于外观的CNN架构的统一框架的Hybridgazenet(HGN)。由多分支网络和不确定性模块组成,使用杂文策略培训HybridgazeNet。与现有的SOTA方法相比,多个具有挑战性的凝视数据集的实验表明,杂交茎具有更好的准确性和泛化能力。稍后将发布代码。
translated by 谷歌翻译
本文提出了一种可视化DNN编码的中间层视觉模式的辨别力的方法。具体而言,我们可视化(1)DNN在训练过程中如何逐渐学习各个中间层中的区域视觉模式,(2)DNN使用低层中的非辨别模式的效果来构建中/高层中的剥离图案通过前向传播。基于我们的可视化方法,我们可以量化DNN学习的知识点(即,判别视觉模式的数量)来评估DNN的表示能力。此外,该方法还提供了新的洞察现有的深度学习技术的信号处理行为,例如对抗攻击和知识蒸馏。
translated by 谷歌翻译
我们研究如何构建一组可以组成的政策来解决一个加强学习任务的集合。每个任务都是不同的奖励函数,被定义为已知功能的线性组合。我们考虑一下我们呼吁改进政策的特定策略组合(SIPS):给定一套政策和一系列任务,SIP是前者的任何构成,其性能至少与其成分的表现相当好所有任务。我们专注于啜饮的最保守的实例化,Set-Max政策(SMPS),因此我们的分析扩展到任何SIP。这包括已知的策略组合运营商,如广义政策改进。我们的主要贡献是一种策略迭代算法,构建一组策略,以最大限度地提高所得SMP的最坏情况性能。该算法通过连续向集合添加新策略来工作。我们表明,生成的SMP的最坏情况性能严格地改善了每次迭代,并且算法仅在不存在导致改进性能的策略时停止。我们经验在网格世界上进行了验证评估了算法,也是来自DeepMind控制套件的一组域。我们确认了我们关于我们算法的单调性能的理论结果。有趣的是,我们还经验展示了算法计算的政策集是多样的,导致网格世界中的不同轨迹以及控制套件中的非常独特的运动技能。
translated by 谷歌翻译
The growing interest in intelligent services and privacy protection for mobile devices has given rise to the widespread application of federated learning in Multi-access Edge Computing (MEC). Diverse user behaviors call for personalized services with heterogeneous Machine Learning (ML) models on different devices. Federated Multi-task Learning (FMTL) is proposed to train related but personalized ML models for different devices, whereas previous works suffer from excessive communication overhead during training and neglect the model heterogeneity among devices in MEC. Introducing knowledge distillation into FMTL can simultaneously enable efficient communication and model heterogeneity among clients, whereas existing methods rely on a public dataset, which is impractical in reality. To tackle this dilemma, Federated MultI-task Distillation for Multi-access Edge CompuTing (FedICT) is proposed. FedICT direct local-global knowledge aloof during bi-directional distillation processes between clients and the server, aiming to enable multi-task clients while alleviating client drift derived from divergent optimization directions of client-side local models. Specifically, FedICT includes Federated Prior Knowledge Distillation (FPKD) and Local Knowledge Adjustment (LKA). FPKD is proposed to reinforce the clients' fitting of local data by introducing prior knowledge of local data distributions. Moreover, LKA is proposed to correct the distillation loss of the server, making the transferred local knowledge better match the generalized representation. Experiments on three datasets show that FedICT significantly outperforms all compared benchmarks in various data heterogeneous and model architecture settings, achieving improved accuracy with less than 1.2% training communication overhead compared with FedAvg and no more than 75% training communication round compared with FedGKT.
translated by 谷歌翻译
In this paper we study the smooth strongly convex minimization problem $\min_{x}\min_y f(x,y)$. The existing optimal first-order methods require $\mathcal{O}(\sqrt{\max\{\kappa_x,\kappa_y\}} \log 1/\epsilon)$ of computations of both $\nabla_x f(x,y)$ and $\nabla_y f(x,y)$, where $\kappa_x$ and $\kappa_y$ are condition numbers with respect to variable blocks $x$ and $y$. We propose a new algorithm that only requires $\mathcal{O}(\sqrt{\kappa_x} \log 1/\epsilon)$ of computations of $\nabla_x f(x,y)$ and $\mathcal{O}(\sqrt{\kappa_y} \log 1/\epsilon)$ computations of $\nabla_y f(x,y)$. In some applications $\kappa_x \gg \kappa_y$, and computation of $\nabla_y f(x,y)$ is significantly cheaper than computation of $\nabla_x f(x,y)$. In this case, our algorithm substantially outperforms the existing state-of-the-art methods.
translated by 谷歌翻译